Back

Machine Learning Project

Machine Learning Analysis of credit card approval.

This Project was written in collaboration with Paolo Caggiano for the Machine Learning course in the Master Degree in Data Science.
Due to the increasing number of credit card users it has become crucial for banks to differentiate between good and bad costumers. Many financial institutions like national and private banks rely on consumers’ information like their basic details, living standards, salary, yearly and monthly returns or their current income source. This complete check and analysis can avoid to the institutions bearing a lot of technical and nontechnical losses. The goal of this research is the identification of those clients that, based on their financial history, could end up being a bad investment for the bank. Therefore, we want to investigate and understand which are the variables that have an important role in determining if a loan is payed off on time by a client. This objective is carried forward through classification, one of the main techniques utilized in Machine Learning.

To answer this questions, we use a dataset from the Kaggle platform. After handling missing values and other operations of pre-processing, we the model through 5 algorithms: Logistic Regression, Random Forest, MultiLayer Perceptron, Naive Bayes and Naive Bayes Tree. The performances of all of them have been measured adequately. In order to get a more effective estimates of the classifier’s performance measures we will use for the next analysis the K Fold cross validation to divide the data into k subsets.
To achieve a better model we implemented other techniques.
For example, the dataset was unbalanced. To solve this we used both Oversampling method and the adoption of a cost matrix.
In order to try to get a better model with a quicker learner we used the Feature Selection method. Feature selection is the process of detecting relevant features and removing irrelevant, redundant, or noisy data.

In conclusion, the increase in all of the F-score measures of all of the five algorithms over the holdout method have shown that both oversampling and the adoption of the cost-matrix are a good way to address the problem of the unbalanced dataset. We also demonstrated how the Cost Sensitive Learning tend to achieve better results over the oversampling approach. Furthermore, this type of learning could be further improved using a real cost matrix, coming from the bank’s data.
For what regards the different algorithms, the Logistic and Naive Bayes classifiers have shown good performances across all of the different approaches implemented. The Random Forest and the Naive Bayes Tree classifiers, instead, tend to give better results if paired with a Cost Sensitive learner. Lastly, the Multilayer Perceptron, achieved very good results using all of the available variables, especially paired with the Cost Sensitive Learner, while taking a big loss in performance when used together with a Feature Selector.

Tags

Machine Learning Knime Model Training Cross-Validation Cost Matrix OverSampling Multilayer Perceptron Feature Selection